Comparative Analysis of DNA Word Abundances in Four Yeast Genomes Using a Novel Statistical Background Model

نویسندگان

  • Ramkumar Hariharan
  • Reji Simon
  • M. Radhakrishna Pillai
  • Todd D. Taylor
چکیده

Previous studies have shown that the identification and analysis of both abundant and rare k-mers or "DNA words of length k" in genomic sequences using suitable statistical background models can reveal biologically significant sequence elements. Other studies have investigated the uni/multimodal distribution of k-mer abundances or "k-mer spectra" in different DNA sequences. However, the existing background models are affected to varying extents by compositional bias. Moreover, the distribution of k-mer abundances in the context of related genomes has not been studied previously. Here, we present a novel statistical background model for calculating k-mer enrichment in DNA sequences based on the average of the frequencies of the two (k-1) mers for each k-mer. Comparison of our null model with the commonly used ones, including Markov models of different orders and the single mismatch model, shows that our method is more robust to compositional AT-rich bias and detects many additional, repeat-poor over-abundant k-mers that are biologically meaningful. Analysis of overrepresented genomic k-mers (4≤k≤16) from four yeast species using this model showed that the fraction of overrepresented DNA words falls linearly as k increases; however, a significant number of overabundant k-mers exists at higher values of k. Finally, comparative analysis of k-mer abundance scores across four yeast species revealed a mixture of unimodal and multimodal spectra for the various genomic sub-regions analyzed.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation of First and Second Markov Chains Sensitivity and Specificity as Statistical Approach for Prediction of Sequences of Genes in Virus Double Strand DNA Genomes

Growing amount of information on biological sequences has made application of statistical approaches necessary for modeling and estimation of their functions. In this paper, sensitivity and specificity of the first and second Markov chains for prediction of genes was evaluated using the complete double stranded  DNA virus. There were two approaches for prediction of each Markov Model parameter,...

متن کامل

Comparative statistical analysis of bacteria genomes in"word"context

Statistical analysis of bacteria genomes has been performed on the basis of 20 complete genomes’ texts origin from Genebank. It has been revealed that the word ranked distributions are quite well approximated by logarithmic law. The results obtained in the absent word investigation show the considerably nonrandom character of DNA texts. In character of autocorrelation function behavior in sever...

متن کامل

Functional Investigation of the Novel BRCA1variant (Glu1661Gly) byComputationalTools andYeastTranscription Activation Assay

Introduction: Mutations in the BRCA1 gene are major risk factors for breast and ovarian cancers. However, the relationship between some BRCA1 mutations and cancer risk remains largely unknown. Cancer risk predictions could be improved by evaluation of the impairment degree in the BRCA1 functions due to a specific mutation. This study aimed to assess the functional effect of a novel variant (Glu...

متن کامل

Functional Investigation of the Novel BRCA1variant (Glu1661Gly) byComputationalTools andYeastTranscription Activation Assay

Introduction: Mutations in the BRCA1 gene are major risk factors for breast and ovarian cancers. However, the relationship between some BRCA1 mutations and cancer risk remains largely unknown. Cancer risk predictions could be improved by evaluation of the impairment degree in the BRCA1 functions due to a specific mutation. This study aimed to assess the functional effect of a novel variant (Glu...

متن کامل

Comparative bioinformatics analysis of a wild diploid Gossypium with two cultivated allotetraploid species

Background: Gossypium thurberi is a wild diploid species that has been used to improve cultivated allotetraploid cotton. G. thurberi belongs to D genome, which is an important wild bio-source for the cotton breeding and genetic research. To a certain degree, chloroplast DNA sequence information are a versatile tool for species identification and phylogenetic implications in plants. Different ch...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 8  شماره 

صفحات  -

تاریخ انتشار 2013